Tram Duong
October 24, 2020
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
import scipy
from scipy import stats
sns.set(rc={'figure.figsize':(10,15)})
# read data
payment_data = pd.read_csv("C:/Github/inpatientCharges.csv")
Few notices from the data overview:
payment_data.head()
payment_data.describe()
payment_data.info()
According to the link: https://data.cms.gov/Medicare-Inpatient/National-Summary-of-Inpatient-Charge-Data-by-Medic/efwk-h4x3, the dataset description is below:
DRG Definition : Classification system that groups similar clinical conditions (diagnoses) and the procedures furnished by the hospital during the stay.
Total Discharges : The number of discharges billed by all providers for inpatient hospital services.
Average Covered Charges : The average charge of all provider's services covered by Medicare for discharges in the DRG. These will vary from hospital to hospital because of differences in hospital charge structures.
Average Total Payment: The average total payments to all providers for the DRG including the MS-DRG amount, teaching, disproportionate share, capital, and outlier payments for all cases. Also included in average total payments are co-payment and deductible amounts that the patient is responsible for and any additional payments by third parties for coordination of benefits.
Average Medicare Payment: The average amount that Medicare pays to the provider for Medicare's share of the MS-DRG. Medicare payment amounts include the MS-DRG amount, teaching, disproportionate share, capital, and outlier payments for all cases. Medicare payments DO NOT include beneficiary co-payments and deductible amounts nor any additional payments from third parties for coordination of benefits.
Some columns names have spaces which need to be removed
payment_data.columns
payment_data.columns = payment_data.columns.str.strip()
All the payment columns include '$' sign which need to be removed and coverted to float type for further analysis
# remove $ sign and convert to float type
payment_data['Average Covered Charges'] = payment_data['Average Covered Charges'].str.strip("$").astype('float')
payment_data['Average Total Payments'] = payment_data['Average Total Payments'].str.strip("$").astype('float')
payment_data['Average Medicare Payments'] = payment_data['Average Medicare Payments'].str.strip("$").astype('float')
Zipcode column contain some 4 digits values which need to converted into the right type as they are missing the leading zero.
payment_data['Provider Zip Code'] = payment_data['Provider Zip Code'].astype(str).str.zfill(5)
The dataset contains payments of inpatients in 50 states. Beside making visualizations for comparing the amount of charges in different states, I plan to build some visualization based on regions to gain more insights.
west = ['WA','OR','CA','ID','NV','MT','WY','UT','AZ','CO','NM']
midwest = ['ND','MN','WI','MI','SD','NE','KS','IA','MO','IL','IN','OH']
south = ['TX','OK','AR','LA','MS','TN','KY','AL','GA','FL','SC','NC','VA','WV','MD','DE']
northeast = ['PA','NJ','NY','CT','MA','RI','VT','NH','ME']
s=pd.DataFrame([west,midwest,south,northeast],index=['West','Midwest','South','Northeast'])
s=s.reset_index().melt('index')
payment_data['Region'] = payment_data['Provider State'].map(dict(zip(s['value'],s['index'])))
payment_data.head()
# Make the PairGrid
fig = plt.figure(figsize=(16,10))
g = sns.PairGrid(payment_data,
x_vars=['Total Discharges', 'Average Covered Charges', 'Average Total Payments', 'Average Medicare Payments'],
y_vars=["Provider State"],
height=10, aspect=.25)
# Draw a dot plot using the stripplot function
g.map(sns.stripplot, size=10, orient="h", linewidth=1)
# Use the same x axis limits on all columns and add better labels
g.set(xlabel="", ylabel="")
# Use semantically meaningful titles for the columns
titles = ["Total Discharges", "Average Coveraged Charges", "Average Total Payments",
"Average Medicare Paymens"]
for ax, title in zip(g.axes.flat, titles):
# Set a different title for each axes
ax.set(title=title)
# Make the grid horizontal instead of vertical
ax.xaxis.grid(False)
ax.yaxis.grid(True)
plt.tight_layout()
plt.show()
fig = plt.figure(figsize=(16,10))
sns.pairplot(payment_data[['Region','Average Total Payments',
'Total Discharges','Average Medicare Payments','Average Covered Charges']], hue= 'Region',height = 4)
stats_df = pd.DataFrame(payment_data, columns=['Average Total Payments',
'Total Discharges','Average Medicare Payments','Average Covered Charges'])
x = stats_df
corr = x.corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
f, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(corr, mask=mask, cmap=cmap, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
The average medicare payments and average covered charges are highly correlated to each other. Furthermore, the average medicare payments are also highly correlated with the average total payments.
fig = plt.figure(figsize=(16,8))
common_drg = payment_data.groupby('DRG Definition').count()['Provider Id'].sort_values(ascending=False)
top_10 = common_drg[:10]
sns.countplot(y='DRG Definition', data=payment_data, palette="Greens_d",
order=pd.value_counts(payment_data['DRG Definition']).iloc[:10].index)
This feature will provide an estimation of the average amount charges/payment by each provider.
# gorup by id
patient_avg_id = payment_data.groupby('Provider Id').mean()[['Total Discharges',
'Average Covered Charges','Average Total Payments','Average Medicare Payments']]
patient_avg_id.head()
Conclusion: This feature shows the average charge/payment of each provider which can be used as a baseline when there is any unsual amount. This insight highlights the business use of this feature and shows how it will be helpful with detecting anomalies.
This feature will provide an estimation of the average amount charges/payment in each state.
patient_avg_state = payment_data.groupby('Provider State').mean()[['Total Discharges',
'Average Covered Charges','Average Total Payments','Average Medicare Payments']]
patient_avg_state.head()
fig, axes = plt.subplots(nrows=4, ncols=1,figsize=(15,10))
plt.subplots_adjust(hspace=1)
for i, ax in enumerate(axes.flatten()):
patient_avg_state[patient_avg_state.columns[i]].plot(kind='bar',ax=ax)
ax.set_title(patient_avg_state.columns[i])
Conclusion: This feature will provide an estimation of the average amount charges/payment in each state which can be used to compare between states and treated as a baseline when there is any unsual amount. This insight highlights the business use of this feature and shows how it will be helpful with detecting anomalies.
This feature displays the average amount charges/payment in each region
patient_avg_reg = payment_data.groupby('Region').mean()[['Total Discharges',
'Average Covered Charges','Average Total Payments','Average Medicare Payments']]
fig, axes = plt.subplots(nrows=2, ncols=2,figsize=(15,10))
plt.subplots_adjust(hspace=0.5)
for i, ax in enumerate(axes.flatten()):
patient_avg_reg[patient_avg_reg.columns[i]].plot(kind='bar',ax=ax)
ax.set_title(patient_avg_reg.columns[i])
Conclusion: This feature will provide an estimation of the average amount charges/payment in each region which can be used to compare between them and treated as a baseline when there is any unsual amount in each region. This insight highlights the business use of this feature and shows how it will be helpful with detecting anomalies.
This feature is the amount that patient pay by different provider. It gives us an idea which provider has the greatest charges.
payment_data['Ave Out of Pocket Payment'] = payment_data['Average Total Payments'] - payment_data['Average Medicare Payments']
oop_pro= payment_data[['Provider Name', 'Ave Out of Pocket Payment']].groupby(by='Provider Name').agg('mean')
oop_pro = oop_pro.sort_values(('Ave Out of Pocket Payment'), ascending=False)
oop_pro.head()
%%capture --no-stdout --no-display output
#Stop warning from showing
# Top 20 out of pocket per provider
a = oop_pro[:20]
plt.figure(figsize=(30,20))
fig,ax= plt.subplots()
fig = sns.barplot(a.iloc[:,0],a.index, color="steelblue",alpha=0.8)
ax.set(ylabel=None)
plt.xlabel("Ave OoP per discharge")
Conclusion: Out of pocket is an important indicator for any hospital bill. This feature helps to define the mean of average out of pocket payment in each provider and can be used as a baseline to compare when any out of pocket charges occur.
This feature is the amount that patient pay by different procedures. It gives us an estimate amount for different procedures
oop_drg= payment_data[['DRG Definition', 'Ave Out of Pocket Payment']].groupby(by='DRG Definition').agg('mean')
oop_drg = oop_drg.sort_values(('Ave Out of Pocket Payment'), ascending=False)
oop_drg.head()
%%capture --no-stdout --no-display output
# prevent warning message from showing
# Top 20 out of pocket by procedure
a = oop_drg[:20]
plt.figure(figsize=(30,20))
fig,ax= plt.subplots()
fig = sns.barplot(a.iloc[:,0],a.index, color="steelblue",alpha=0.8)
ax.set(ylabel=None)
plt.xlabel("Ave OoP by Procedure")
Conclusion: This feature helps to define the mean of average out of pocket payment in each procedures and can be used as a baseline to detect anomalies.
This feature shows the average amount of out of pocket per discharge for different procedures. If there are a high amount of charge occuring, they would be captured to possibily be investigated if needed.
payment_data['Ave OoP per discharge'] = payment_data['Ave Out of Pocket Payment']/payment_data['Total Discharges']
oop_dis= payment_data[['DRG Definition', 'Ave OoP per discharge']].groupby(by='DRG Definition').agg('mean')
oop_dis = oop_dis.sort_values(('Ave OoP per discharge'), ascending=False)
oop_dis.head()
%%capture --no-stdout --no-display output
# prevent warning message from showing
# Top 20 out of pocket per discharge
a = oop_dis[:20]
plt.figure(figsize=(30,20))
fig,ax= plt.subplots()
fig = sns.barplot(a.iloc[:,0],a.index, color="steelblue",alpha=0.8)
ax.set(ylabel=None)
plt.xlabel("Ave OoP per discharge")
Conclusion: This feature helps to define the mean of average out of pocket payment per discharge for each procedure and can be used as a baseline to detect anomaly. If a discharge pay a big difference amount from the mean for specific procedure, it would be noticable
This feature displays the proportion of the total payment compared to covered charge.
payment_data['Percent of Payment Covered'] = round((payment_data['Average Total Payments'] /
payment_data['Average Covered Charges'])*100,2)
payment_data['Percent of Payment Covered'].head()
pc_per= payment_data[['DRG Definition', 'Percent of Payment Covered']].groupby(by='DRG Definition').agg('mean')
pc_per = pc_per.sort_values(('Percent of Payment Covered'), ascending=False)
pc_per.head()
%%capture --no-stdout --no-display output
# prevent warning message from showing
# Top 20 highest percent of payment covered
a = pc_per[:20]
plt.figure(figsize=(30,20))
fig,ax= plt.subplots()
fig = sns.barplot(a.iloc[:,0],a.index, color="steelblue",alpha=0.8)
ax.set(ylabel=None)
plt.xlabel("Percent of Payment Covered")
Conclusion: This feature helps to define the percent of payment covered and can be used as a baseline to detect anomaly. If an unsual percentage of payment for a procedure occurs, it would be noticable.
This feature calculates the proportion covered by medicare for different procedures
payment_data['Medicare Coverage Ratio'] = (payment_data['Average Medicare Payments'] / payment_data['Average Total Payments'])
med_cv = payment_data[['DRG Definition', 'Medicare Coverage Ratio']].groupby(by='DRG Definition').agg('mean')
med_cv = med_cv.sort_values(('Medicare Coverage Ratio'), ascending=True)
med_cv.head()
%%capture --no-stdout --no-display output
# prevent warning message from showing
# Top 20 highest medicare covered
a = med_cv[:20]
plt.figure(figsize=(20,20))
fig,ax= plt.subplots()
fig = sns.barplot(a.iloc[:,0],a.index, color="steelblue",alpha=0.8)
ax.set(ylabel=None)
plt.xlabel("Medicare Coverage Ratio")
Conclusion: This feature helps to define the percent of medicare coverage for each procedure and can be used as a baseline to detect anomaly. If an unsual percentage of medicare charges for a procedure occurs, it would be noticable.
This feature displays the average percentage covered by medicare in each state
med_cv_state = payment_data[['Provider State', 'Medicare Coverage Ratio']].groupby(by='Provider State').agg('mean')
med_cv_state = med_cv_state.sort_values(('Medicare Coverage Ratio'), ascending=False)
med_cv_state.head()
%%capture --no-stdout --no-display output
# prevent warning message from showing
fig,ax= plt.subplots()
fig = sns.barplot(med_cv_state.iloc[:,0],med_cv_state.index,color="steelblue",alpha=0.8)
ax.set(ylabel=None)
plt.xlabel("Medicare Coverage Ratio")
Conclusion: This feature helps to define the percent of medicare coverage in each state and can be used as a baseline to detect anomalies. If an unsual percentage of medicare happens, it would be noticable
Z-scores can quantify the unusualness of an observation when your data follow the normal distribution. Z-scores are the number of standard deviations above and below the mean that each value falls. For example, a Z-score of 2 indicates that an observation is two standard deviations above the average while a Z-score of -2 signifies it is two standard deviations below the mean. A Z-score of zero represents a value that equals the mean.
An absolute Z-score greater than 3 are an outlier/anomaly as they fall outside 99.7% of the data.
This feature calculates the z-score for average total payment
payment_data['Z-score Average Total Payments'] = stats.zscore(payment_data['Average Total Payments'])
payment_data['Z-score Average Total Payments'].max()
Conclusion: An absolute Z-score greater than 3 are an outlier/anomaly as they fall outside 99.7% of the data.This feature helps to detect any outlier in total payments. The example above shows that there is a z-score of 19.10 in the average total payment which needs to be investigated.
This feature calculates the z-score for average medicare payment
payment_data['Z-score Average Medicare Payments'] = stats.zscore(payment_data['Average Medicare Payments'])
payment_data['Z-score Average Medicare Payments'].max()
Conclusion: An absolute Z-score greater than 3 are an outlier/anomaly as they fall outside 99.7% of the data.This feature helps to detect any outlier in medicare payments. The example above shows that there is a z-score of 19.99 in the average medicare payment which need to be investigated.
This feature calculates the average cost and number of cases of each procedure
ave_cv = payment_data[['DRG Definition', 'Average Covered Charges']].groupby(by='DRG Definition').agg(['mean','count'])
ave_cv = ave_cv.sort_values(('Average Covered Charges', 'mean'), ascending=False)
ave_cv.head()
# Top 10 highest Average Medicare Payments
a = ave_cv[:10]
%%capture --no-stdout --no-display output
# prevent warning message from showing
fig,ax= plt.subplots()
fig = sns.barplot(a.iloc[:,0],a.index, color="steelblue",alpha=0.8)
ax.set(ylabel=None)
plt.xlabel("Average Covered Charges ($)")
plt.figure(figsize=(15,15))
Conclusion: This features provides the insights of hospital bill for different procedures. From that, we could detect any unsual charge for specific procedure.
This feature calculates the total cases of all procedures in each region
common_drug = payment_data[['Region', 'DRG Definition']].groupby(by=['Region','DRG Definition']).agg({'DRG Definition': 'count'})
common_drug[:20]
Conclusion: This feature can be used to identify non-common procedures and common procedures in each region, thus giving us an idea of what procedures are most common in each region. Therefore, we would be able to take closer look to different price point by provider for common procedures and non-common procedures in order to detect fraud.
This feature is the differences between maximum and minimum payments for each procedure
differences = payment_data[['DRG Definition','Average Total Payments']].groupby(by='DRG Definition').agg(['max','min'])
differences['Difference'] = differences[('Average Total Payments','max')] - differences[('Average Total Payments','min')]
differences = differences[:20].sort_values(by='Difference',ascending=False)
# the results were limited to the first 20 values, but can be changed to include as many or as little as needed by adjusting the range
%%capture --no-stdout --no-display output
# prevent warning message from showing
sns.set_context("paper")
ax = sns.barplot(differences["Difference"],differences.index, color="steelblue",alpha=0.8)
ax.set(ylabel=None)
plt.xlabel("Differences in Average Total Payment ($)")
plt.figure(figsize=(15,15))
Conclusion: This feature helps to identify the difference in payment for the same procedure. If the difference is high for a procedure, it means that the payment varies largely between different states or different providers. Thus, we need to investigate further for these procedures.
In this part, I use K-means clustering algorithm to explore the dataset, following the steps below:
- Drop irrelevant variables
- Standardization of Numerical / Float variables
payment_data = payment_data.drop(columns = ['Provider Id'])
features = payment_data
%%capture --no-stdout --no-display output
features = features.merge(differences, on = "DRG Definition", how = "left")
# dtop Ave max and min here and keep difference
features.head()
features.rename(columns={('Average Total Payments', 'max'):'max',
('Average Total Payments', 'min'): 'min',
('Difference', ''): 'Differences'
}, inplace=True)
features.drop(
['min', 'max'],
axis=1, inplace=True)
# not doing this one because the df only has the medicare coverage ratio columns, this one provides information
## of coverage ratio by state.
features = features.merge(med_cv_state, on = "Provider State", how = "left")
features = features.merge(patient_avg_state, on = "Provider State", how= "left")
# Rename the last 4 columns as Ave by state
features.head()
features = features.rename(columns={"Total Discharges_x": "Total_Discharges",
"Average Covered Charges_x": "Ave_Covered_Charges",
"Average Total Payments_x": "Ave_Total_Payments",
"Average Medicare Payments_x":"Aver_Medicare_Payments",
"Ave Out of Pocket Payment":"Ave_OOP",
"Percent of Payment Covered" : "Prop_payment_covered",
"Medicare Coverage Ratio_x":"Medicare_Coverage_Ratio",
"Z-score Average Total Payments": "Zscore_ave_total_payment",
"Z-score Average Medicare Payments": "Zscore_ave_medicare_payment",
"Medicare Coverage Ratio_y": "Medicare_coverage_ratio_bystate",
"Total Discharges_y":"Mean_Total_Discharge_bystate",
"Average Covered Charges_y":"Mean_Ave_Covered_bystate",
"Average Total Payments_y":"Mean_Ave_Total_Payment_bystate",
"Average Medicare Payments_y":"Mean_Ave_Medicare_bystate"})
K-means only works with numerical columns and there are more than 50 unique values in categorical columns like DRG definition, Provider ID, provider name, provider city, provider state, provider zipcode, hospital referral regions. Therefore, one hot encoder doesn't make significant support to clustering method. Additionally, using one hot encoding creates many different new binary features which does not apply to K-mean clustering accurately. Therefore, all categorical columns are decided to drop.
categorical_col = features.loc[:, features.dtypes == np.object]
a = categorical_col.columns
features = features.drop(columns = a)
features.columns
features.isnull().sum()
With a large amount of missing values in differences column, it does not have significant impact to the model after imputing missing values
features['Differences'].isnull().sum()/len(features)
features = features.drop(columns = ['Differences'])
corr = features.corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
f, ax = plt.subplots(figsize=(15, 15))
sns.heatmap(corr, mask=mask, cmap=cmap, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
In general, feautures that are too correlated do not improve model efficiency and also affect the performance of linear regression and random forest models, making the learning algorithms slower to create and train. Therefore, I removed highly correlated features to prevent multicollinearity trhough the following function:
# Function to remove columns with high correlation value
def correlation(dataset, threshold):
"""
Remove columns that do exceeed correlation threshold
"""
col_corr = set() # Set of all the names of deleted columns
corr_matrix = dataset.corr()
for i in range(len(corr_matrix.columns)):
for j in range(i):
if (corr_matrix.iloc[i, j] >= threshold) and (corr_matrix.columns[j] not in col_corr):
colname = corr_matrix.columns[i] # getting the name of column
col_corr.add(colname)
if colname in dataset.columns:
del dataset[colname] # deleting the column from the dataset
return(dataset)
features = correlation(features, 0.8)
features.columns
As total discharges amount varies depending on the hospital location, size, treatement provided. This feature would not support the model performance and might affect the accuracy of the model. Thus, I will drop both of the discharges features before further analysis.
features = features.drop(columns = ['Total_Discharges','Mean_Total_Discharge_bystate'])
from sklearn.model_selection import train_test_split as tts
from sklearn.preprocessing import StandardScaler
X_train, X_test = tts(features, test_size=0.30)
X_train = StandardScaler().fit_transform(X_train)
X_train = pd.DataFrame(X_train)
X_test = StandardScaler().fit_transform(X_test)
X_test = pd.DataFrame(X_test)
In this part. I will use autoencoder and isolation forest algorithm to detect outliers. I will try 3 different models, the average and the maximun of maximun methods for both of the approaches. Furthermore, the feature selection plays an important role to the success of a model, beside their statistical support, it is important to understand their business insights and how they would affect the model. Below is a briefly explaination of the selected features and why they are important:
Autoencoder techniques can perform non-linear transformations with their non-linear activation function and multiple layers.It is more efficient to train several layers with an autoencoder, rather than training one huge transformation with PCA.
Autoencoder is an unsupervised learning of artificial neural network that copies the input values (input layer) to the output values (output layer) throughout many hidden layers. The hidden layers in an autoencoder model must have fewers dimensions that those of the input or outpu layers. If the number of neurons is more than the input and output layers, the model will be given too much capacity to learn the data and simply return the output values as the input values, without extracting any essential information.
from pyod.models.auto_encoder import AutoEncoder
#from pyod.models.auto_encoder import AutoEncoder
clf1 = AutoEncoder(hidden_neurons =[9, 2, 2, 9])
clf1.fit(X_train)
# Get the outlier scores for the train data
y_train_scores_1 = clf1.decision_scores_
# Predict the anomaly scores
y_test_scores_1 = clf1.decision_function(X_test) # outlier scores
y_test_scores_1 = pd.Series(y_test_scores_1)
# Plot it!
import matplotlib.pyplot as plt
plt.figure(figsize=(30,20))
plt.hist(y_test_scores_1, bins='auto', lw=0, color='mediumvioletred')
plt.xticks(size = 15)
plt.yticks(size = 15)
plt.title("Histogram for Model Clf1 Anomaly Scores",fontsize = 20)
plt.show()
By using histogram to count the frequency by the anomaly score, it indicates that the higher the score is, the lower the frequency is, thus providing the group of outliers. As the chart shows that the score starts going down around 8 to 10, I choose 8 to be the cur point and >= 8 to be the outliers.
#Get the Summary Statistics by Cluster
df_test = X_test.copy()
old_col = df_test.columns[:9]
feature_name = features.columns
df_test.rename(columns=dict(zip(old_col, feature_name)), inplace=True)
df_test['score'] = y_test_scores_1
df_test['cluster'] = np.where(df_test['score']<8, 'Normal', 'Outlier')
df_test['cluster'].value_counts()
df_test.groupby('cluster').mean()
sns.pairplot(df_test.iloc[:,[1,2,3,4,9]], hue = 'cluster', palette='husl')
Note: From the summary statistic table above, the difference of the average anomaly score between the outlier cluster and the normal cluater is dramatically large (6 times difference). Additionally, the average values of other features also show big differences between them.
clf2 = AutoEncoder(hidden_neurons =[9, 10,2, 10, 9])
clf2.fit(X_train)
# Predict the anomaly scores
y_test_scores_2 = clf2.decision_function(X_test)
y_test_scores_2 = pd.Series(y_test_scores_2)
# Plot the histogram
import matplotlib.pyplot as plt
plt.hist(y_test_scores_2, bins='auto', lw=0, color='mediumvioletred')
plt.xticks(size = 20)
plt.yticks(size = 20)
plt.title("Histogram for Model Clf2 Anomaly Scores",fontsize = 20)
plt.show()
By using histogram to count the frequency by the anomaly score, it indicates that the higher the score is, the lower the frequency is, thus providing the group of outliers. As the chart shows that the frequency starts going down around 8 to 10, I choose 8 to be the cur point and >= 8 to be the outliers.
#Get the Summary Statistics by Cluster
df_test = X_test.copy()
old_col = df_test.columns[:9]
feature_name = features.columns
df_test.rename(columns=dict(zip(old_col, feature_name)), inplace=True)
df_test['score'] = y_test_scores_2
df_test['cluster'] = np.where(df_test['score']<8, 'Normal', 'Outlier')
df_test['cluster'].value_counts()
df_test.groupby('cluster').mean()
sns.pairplot(df_test.iloc[:,[1,2,3,4,9]], hue = 'cluster', palette='husl')
Note: The results in model 2 are similar like model 1. From the summary statistic table above, the difference of the average anomaly score between the outlier cluster and the normal cluater is dramatically large (6 times difference). Additionally, the average values of other features also show big differences between them.
# Step 1: Build the model
clf3 = AutoEncoder(hidden_neurons =[9, 15, 10, 2, 10,15,9])
clf3.fit(X_train)
# Predict the anomaly scores
y_test_scores_3 = clf3.decision_function(X_test)
y_test_scores_3 = pd.Series(y_test_scores_3)
# Step 2: Determine the cut point
import matplotlib.pyplot as plt
plt.hist(y_test_scores_3, bins='auto', lw=0, color='mediumvioletred')
plt.xticks(size = 15)
plt.yticks(size = 15)
plt.title("Histogram for Model Clf3 Anomaly Scores",fontsize = 20)
plt.show()
By using histogram to count the frequency by the anomaly score, it indicates that the higher the score is, the lower the frequency is, thus providing the group of outliers. As the chart shows that the frequency starts going down around 8 to 10, I choose 8 to be the cur point and >= 8 to be the outliers.
#Get the Summary Statistics by Cluster
df_test = X_test.copy()
old_col = df_test.columns[:9]
feature_name = features.columns
df_test.rename(columns=dict(zip(old_col, feature_name)), inplace=True)
df_test['score'] = y_test_scores_3
df_test['cluster'] = np.where(df_test['score']<8, 'Normal', 'Outlier')
df_test['cluster'].value_counts()
df_test.groupby('cluster').mean()
sns.pairplot(df_test.iloc[:,[1,2,3,4,9]], hue = 'cluster', palette='husl')
Note: The results in model 3 are similar to model 1 and model 2, including the number of outliers and the summary statistic table. From the summary statistic table above, the difference of the average anomaly score between the outlier cluster and the normal cluater is dramatically large (6 times difference). Additionally, the average values of other features also show big differences between them.
Although unsupervised techniques are powerful in detecting outliers, they are prone to overfitting and unstable results. The solution is to train multiple models then aggregate the scores
%%capture --no-stdout --no-display output
#Stop warning from showing
# Put all the predictions in a data frame
from pyod.models.combination import aom, moa, average, maximization
# Put all the predictions in a data frame
train_scores = pd.DataFrame({'clf1': clf1.decision_scores_,
'clf2': clf2.decision_scores_,
'clf3': clf3.decision_scores_
})
test_scores = pd.DataFrame({'clf1': clf1.decision_function(X_test),
'clf2': clf2.decision_function(X_test),
'clf3': clf3.decision_function(X_test)
})
from pyod.models.combination import aom, moa, average, maximization
from pyod.utils.utility import standardizer
# Although we did standardization before, it was for the variables.
# Now we do the standardization for the decision scores
train_scores_norm, test_scores_norm = standardizer(train_scores,test_scores)
# Combination by average
y_by_average = average(test_scores_norm)
import matplotlib.pyplot as plt
plt.hist(y_by_average, bins='auto', lw=0, color='mediumvioletred')
plt.xticks(size = 15)
plt.yticks(size = 15)
plt.title("Histogram for Average Method",fontsize = 20)
plt.show()
By using histogram to count the frequency by the anomaly score, it indicates that the higher the score is, the lower the frequency is, thus providing the group of outliers. As the chart shows that the fequency starts going down somewhere below 5, I choose 2 to be the cur point and >= 2 to be the outliers.
df_test = X_test.copy()
old_col = df_test.columns[:9]
feature_name = features.columns
df_test.rename(columns=dict(zip(old_col, feature_name)), inplace=True)
df_test['y_by_average_score'] = y_by_average
df_test['y_by_average_cluster'] = np.where(df_test['y_by_average_score']<2,'Normal', 'Outlier')
df_test['y_by_average_cluster'].value_counts()
df_test.groupby('y_by_average_cluster').mean()
sns.pairplot(df_test.iloc[:,[1,2,3,4,9]], hue = 'y_by_average_cluster', palette='husl')
Note: From the summary statistic table above, the difference of the average anomaly score between the outlier cluster and the normal cluater is high (3.8 compare to -0.12). Additionally, the average values of other features also show big differences between them. This indicates that the data points in the outlier cluster is anomolous and need to be investigated.
# Combination by max
y_by_maximization = maximization(test_scores_norm)
import matplotlib.pyplot as plt
plt.hist(y_by_maximization, bins='auto', lw=0, color='mediumvioletred')
plt.xticks(size = 15)
plt.yticks(size = 15)
plt.title("Histogram for MOM Method",fontsize = 20)
plt.show()
By using histogram to count the frequency by the anomaly score, it indicates that the higher the score is, the lower the frequency is, thus providing the group of outliers. As the chart shows that the fequency starts going down somewhere below 5, I choose 3 to be the cur point and >= 3 to be the outliers.
df_test = X_test.copy()
old_col = df_test.columns[:9]
feature_name = features.columns
df_test.rename(columns=dict(zip(old_col, feature_name)), inplace=True)
df_test['y_by_maximization_score'] = y_by_maximization
df_test['y_by_maximization_cluster'] = np.where(df_test['y_by_maximization_score']<3, 'Normal', 'Outlier')
df_test['y_by_maximization_cluster'].value_counts()
df_test.groupby('y_by_maximization_cluster').mean()
sns.pairplot(df_test.iloc[:,[1,2,3,4,9]], hue = 'y_by_maximization_cluster', palette='husl')
Note: The result of MOM is similar to the average method. From the summary statistic table above, the difference of the average anomaly score between the outlier cluster and the normal cluater is high (3.8 compare to -0.12). Additionally, the average values of other features also show big differences between them. This indicates that the data points in the outlier cluster is anomolous and need to be investigated.
Isolation algorithm is useful for both supervised and unsupervised learning. Isolation Forest is an unsupervised learning that calculates an anomaly score and seperates into binary based on an anomaly threshhold. The way that the algorithm constructs the separation is by first creating isolation trees or random decision trees. Then, the score is calculated as the path length to isolate the observation. Isolation forest randomly choose split points among ranomly choosen variables which help to reduce the chance of overfitting.
Isolation forest is an advanced outlier dection that delects anomalies based on the concept of isolation instead of distance or density measurment. It is different from other methods like KNN or PCA in anomalies dectection and is knowns as an effective method at reducing frauds.
from pyod.models.iforest import IForest
clf1 = IForest(behaviour="new", bootstrap=True,n_jobs=-1,)
clf1.fit(X_train)
# clf.decision_function: Predict raw anomaly score of X using the fitted detector.
# We apply the model to the test data X_test to get the outlier scores.
y_test_scores_1 = clf1.decision_function(X_test) # outlier scores
y_test_scores_1 = pd.Series(y_test_scores_1)
import matplotlib.pyplot as plt
plt.hist(y_test_scores_1, bins='auto', lw=0, color='mediumvioletred')
plt.xticks(size = 15)
plt.yticks(size = 15)
plt.title("Histogram with 'auto' bins (Model 1)",fontsize = 20)
plt.show()
By using histogram to count the frequency by the anomaly score, it indicates that the higher the score is, the lower the frequency is, thus providing the group of outliers. As the chart shows that the fequency starts going down around 0.1, I choose 0.1 to be the cur point and >= 0.1 to be the outliers.
df_test = X_test.copy()
old_col = df_test.columns[:9]
feature_name = features.columns
df_test.rename(columns=dict(zip(old_col, feature_name)), inplace=True)
df_test['score'] = y_test_scores_1
df_test['cluster'] = np.where(df_test['score']<0.15, 'Normal', 'Outlier')
df_test['cluster'].value_counts()
df_test.groupby('cluster').mean()
sns.pairplot(df_test.iloc[:,[1,2,3,4,9]], hue = 'cluster', palette='husl')
Note: From the summary statistic table above, the difference of the average anomaly score between the outlier cluster and the normal cluater is high. Additionally, the average values of other features also show big differences between them. This indicates that the data points in the outlier cluster is anomolous and need to be investigated.
clf2 = IForest(max_samples=50, bootstrap=True,n_jobs=-1,)
clf2.fit(X_train)
# clf.decision_function: Predict raw anomaly score of X using the fitted detector.
# We apply the model to the test data X_test to get the outlier scores.
y_test_scores_2 = clf2.decision_function(X_test) # outlier scores
y_test_scores_2 = pd.Series(y_test_scores_2)
y_test_scores_2.head()
plt.hist(y_test_scores_2, bins='auto', lw=0, color='mediumvioletred')
plt.xticks(size = 15)
plt.yticks(size = 15)
plt.title("Histogram with 'auto' bins (Model 2)",fontsize = 20)
plt.show()
By using histogram to count the frequency by the anomaly score, it indicates that the higher the score is, the lower the frequency is, thus providing the group of outliers. As the chart shows that the fequency starts going down around 0.1, I choose 0.1 to be the cur point and >= 0.1 to be the outliers.
df_test = X_test.copy()
old_col = df_test.columns[:9]
feature_name = features.columns
df_test.rename(columns=dict(zip(old_col, feature_name)), inplace=True)
df_test['score'] = y_test_scores_2
df_test['cluster'] = np.where(df_test['score']<0.1, 'Normal', 'Outlier')
df_test['cluster'].value_counts()
df_test.groupby('cluster').mean()
sns.pairplot(df_test.iloc[:,[1,2,3,4,9]], hue = 'cluster', palette='husl')
Note: The results in model 2 are similar to model 1. From the summary statistic table above, the difference of the average anomaly score between the outlier cluster and the normal cluater is high. Additionally, the average values of other features also show big differences between them. This indicates that the data points in the outlier cluster is anomolous and need to be investigated.
clf3 = IForest(behaviour="new", max_samples=100, bootstrap=True,n_jobs=-1,)
clf3.fit(X_train)
# clf.decision_function: Predict raw anomaly score of X using the fitted detector.
# We apply the model to the test data X_test to get the outlier scores.
y_test_scores_3 = clf3.decision_function(X_test) # outlier scores
y_test_scores_3 = pd.Series(y_test_scores_3)
y_test_scores_3.head()
plt.hist(y_test_scores_3, bins='auto', lw=0, color='mediumvioletred')
plt.xticks(size = 15)
plt.yticks(size = 15)
plt.title("Histogram with 'auto' bins (Model 3)",fontsize = 20)
plt.show()
By using histogram to count the frequency by the anomaly score, it indicates that the higher the score is, the lower the frequency is, thus providing the group of outliers. As the chart shows that the fequency starts going down around 0.1, I choose 0.1 to be the cur point and >= 0.1 to be the outliers.
df_test = X_test.copy()
old_col = df_test.columns[:9]
feature_name = features.columns
df_test.rename(columns=dict(zip(old_col, feature_name)), inplace=True)
df_test['score'] = y_test_scores_3
df_test['cluster'] = np.where(df_test['score']<0.1, 'Normal', 'Outlier')
df_test['cluster'].value_counts()
df_test.groupby('cluster').mean()
sns.pairplot(df_test.iloc[:,[1,2,3,4,9]], hue = 'cluster', palette='husl')
Note: The results in model 3 are similar to model 1 and model 2. From the summary statistic table above, the difference of the average anomaly score between the outlier cluster and the normal cluater is high. Additionally, the average values of other features also show big differences between them. This indicates that the data points in the outlier cluster is anomolous and need to be investigated.
# The predictions of the training data can be obtained by clf.decision_scores_.
# It is already generated during the model building process.
train_scores = pd.DataFrame({'clf1': clf1.decision_scores_,
'clf2': clf2.decision_scores_,
'clf3': clf3.decision_scores_
})
# The predictions of the test data need to be predicted using clf.decision_function(X_test)
test_scores = pd.DataFrame({'clf1': clf1.decision_function(X_test),
'clf2': clf2.decision_function(X_test),
'clf3': clf3.decision_function(X_test)
})
# Although we did standardization before, it was for the variables.
# Now we do the standardization for the decision scores
train_scores_norm, test_scores_norm = standardizer(train_scores,test_scores)
# Combination by average
y_by_average = average(test_scores_norm)
import matplotlib.pyplot as plt
plt.hist(y_by_average, bins='auto', lw=0, color='mediumvioletred')
plt.xticks(size = 15)
plt.yticks(size = 15)
plt.title("Histogram for Average Method",fontsize = 20)
plt.show()
By using histogram to count the frequency by the anomaly score, it indicates that the higher the score is, the lower the frequency is, thus providing the group of outliers. As the chart shows that the fequency starts going down around 3, I choose 3 to be the cur point and >= 3 to be the outliers.
df_test = X_test.copy()
old_col = df_test.columns[:9]
feature_name = features.columns
df_test.rename(columns=dict(zip(old_col, feature_name)), inplace=True)
df_test['y_by_average_score'] = y_by_average
df_test['y_by_average_cluster'] = np.where(df_test['y_by_average_score']<3,'Normal', 'Outlier')
df_test['y_by_average_cluster'].value_counts()
df_test.groupby('y_by_average_cluster').mean()
sns.pairplot(df_test.iloc[:,[1,2,3,4,9]], hue = 'y_by_average_cluster', palette='husl')
Note: From the summary statistic table above, the difference of the average anomaly score between the outlier cluster and the normal cluater is high. Additionally, the average values of other features also show big differences between them. This indicates that the data points in the outlier cluster is anomolous and need to be investigated.
# Combination by max
y_by_maximization = maximization(test_scores_norm)
import matplotlib.pyplot as plt
plt.hist(y_by_maximization, bins='auto', lw=0, color='mediumvioletred')
plt.xticks(size = 15)
plt.yticks(size = 15)
plt.title("Histogram for MOM Method",fontsize = 20)
plt.show()
By using histogram to count the frequency by the anomaly score, it indicates that the higher the score is, the lower the frequency is, thus providing the group of outliers. As the chart shows that the fequency starts going down around 3, I choose 3 to be the cur point and >= 3 to be the outliers.
df_test = X_test.copy()
old_col = df_test.columns[:9]
feature_name = features.columns
df_test.rename(columns=dict(zip(old_col, feature_name)), inplace=True)
df_test['y_by_maximization_score'] = y_by_maximization
df_test['y_by_maximization_cluster'] = np.where(df_test['y_by_maximization_score']<3,'Normal', 'Outlier')
df_test['y_by_maximization_cluster'].value_counts()
df_test.groupby('y_by_maximization_cluster').mean()
sns.pairplot(df_test.iloc[:,[1,2,3,4,9]], hue = 'y_by_maximization_cluster', palette='husl')
Note: The results of MOM method is similar to the results in the average method. From the summary statistic table above, the difference of the average anomaly score between the outlier cluster and the normal cluater is high. Additionally, the average values of other features also show big differences between them. This indicates that the data points in the outlier cluster is anomolous and need to be investigated.
After trying both autoencoder and isolation techniques, I think combining different approaches/method and comparing results increases prediction confidence, and reduce bias. From that, I believe it creates a higher level of confidence, especially since each parameter selection contains bias. As the resulls come out similar within each algorithm, my takeaways would be investigating the accounts that are defined as outliers in the techniques utilized.